Skip to content

fix(graph): introduce Status.SKIPPED for cancel_node; graph continues downstream#2258

Open
Zelys-DFKH wants to merge 7 commits into
strands-agents:mainfrom
Zelys-DFKH:fix/graph-cancel-node-graceful-skip
Open

fix(graph): introduce Status.SKIPPED for cancel_node; graph continues downstream#2258
Zelys-DFKH wants to merge 7 commits into
strands-agents:mainfrom
Zelys-DFKH:fix/graph-cancel-node-graceful-skip

Conversation

@Zelys-DFKH

@Zelys-DFKH Zelys-DFKH commented May 6, 2026

Copy link
Copy Markdown
Contributor

Description

When cancel_node is set in a BeforeNodeCallEvent hook, _execute_node raised RuntimeError after emitting MultiAgentNodeCancelEvent, which propagated through _execute_nodes_parallel and killed the graph. Downstream nodes never ran, the overall status became FAILED, and interrupt-and-resume workflows had no recovery path.

The fix introduces Status.SKIPPED. The bypassed node records result=None and status=Status.SKIPPED, lands in completed_nodes so downstream readiness checks are satisfied, and returns cleanly. Using Status.COMPLETED here would be wrong: a node that never ran is not completed, and conflating the two makes it impossible to distinguish "ran successfully" from "intentionally bypassed" in loop control and observability tools.

Three fixes came alongside:

  • _build_node_input filters SKIPPED deps before building dependency_results, so when all upstream deps are skipped the downstream node receives the original task with no orphaned "From node_x:" header
  • _from_dict reads the persisted NodeResult.status to restore Status.SKIPPED correctly on resume, rather than hardcoding Status.COMPLETED for everything in completed_nodes
  • GraphResult gains skipped_nodes: int; completed_nodes now counts only nodes that actually ran

Also opened #2401 for the complementary CANCELLED semantic (abort-branch, downstream does not run). That issue raises the cancel_nodeskip_node naming question; if maintainers prefer Option A, the rename can happen in this PR before it merges.

Related Issues

Resolves #2240

Documentation PR

N/A

Type of Change

Bug fix

Testing

Ran hatch run prepare. Updated test_graph_cancel_node (parametrized ×2) to assert Status.SKIPPED on both NodeResult.status and node.execution_status. Updated test_graph_cancel_node_downstream_executes to assert: downstream node executes (step_b.stream_async.assert_called_once()), step_b receives only the original task with no orphaned header, graph_result.skipped_nodes == 1, and graph_result.completed_nodes == 1.

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions github-actions Bot added the size/s label May 6, 2026
@yonib05 yonib05 added area-multiagent Multi-agent related area-hooks Features or requests that might be implementable via hooks labels May 27, 2026
@yananym

yananym commented May 29, 2026

Copy link
Copy Markdown
Contributor

Thanks for working on this @Zelys-DFKH — this addresses a real pain point. I filed the original issue (#2240) and have been thinking about the design here, especially considering future graph looping support (revisiting previous nodes via edge conditions).

Proposal: Introduce two distinct statuses — SKIPPED and CANCELLED

After looking at how the TypeScript Strands SDK and LangGraph handle this, I think we should distinguish between two semantics:

1. SKIPPED — skip this node, continue downstream (what this PR does)

This is the "skip and continue" semantic. The node is intentionally bypassed, but downstream dependents should still execute. This is what my bug report asks for.

Design:

  • Introduce Status.SKIPPED enum value (not COMPLETED — a node that never ran shouldn't be marked as completed)
  • Dependency resolution treats SKIPPED the same as COMPLETED — the node counts as "satisfied" for downstream readiness checks in _find_newly_ready_nodes / _is_node_ready_for_resume
  • _build_node_input should skip the header entirely for SKIPPED deps (no empty "From node_x:" with nothing underneath)
  • reset_on_revisit clears SKIPPED nodes (same as COMPLETED) — they become eligible for execution again on loop re-entry
  • The NodeResult should not store a RuntimeError — a skipped node isn't an error. Use a lightweight marker (e.g., None result or a dedicated sentinel) so get_agent_results() cleanly returns []

The SKIPPED status is valuable because:

  • It's semantically honest — a node that never ran isn't "completed"
  • Edge conditions on outgoing edges can inspect execution_status == SKIPPED vs COMPLETED and route differently (powerful for loop control)
  • Observability — dashboards/logs can distinguish "ran successfully" from "intentionally bypassed"
  • No ambiguity with nodes that ran but legitimately returned empty output

2. CANCELLED — kill this branch gracefully (align with TS SDK)

This is what the TypeScript SDK does with beforeEvent.cancel: the node is cancelled, marked as a terminus, and downstream nodes do NOT execute. No error propagates — the graph can still complete via other branches.

This could be a separate hook field (e.g., cancel_node for cancel, skip_node for skip) or a single field with enum values ("skip" vs "cancel").

Why both are needed (especially for looping)

Consider a loop: A → B → C → (edge back to A if condition)

  • SKIPPED: On iteration 2, B is skipped because it already gathered data. C still runs, the back-edge fires, and the loop continues. This is the common case for conditional execution within loops.
  • CANCELLED: You want to abort a branch entirely — e.g., an optional side-branch that's no longer relevant. The main loop path continues via other branches.

Without the distinction, you can't express "skip but continue" vs "stop this path" — both of which are legitimate control flow decisions.

Complementary mechanism: Edge conditions with invocation_state

PR #2305 (merged) added invocation_state to edge condition evaluation, which provides an alternative way to route around nodes at the edge level. The two mechanisms are complementary:

  • Edge conditions: Control whether to traverse a path (node never enters the execution batch)
  • skip_node / cancel_node hooks: Control whether to execute a node that's already been reached (node is in the batch but gets bypassed)

Both are needed because edge conditions are evaluated before batch formation, while hooks fire after.


Happy to collaborate on the implementation if this direction makes sense. The core of your PR is right — the change from "raise RuntimeError" to "return cleanly" is exactly the fix. The suggestion is to layer proper status semantics on top.

@yonib05 yonib05 added python Pull requests that update python code bug Something isn't working labels May 29, 2026
@Zelys-DFKH

Copy link
Copy Markdown
Contributor Author

Agreed on SKIPPED vs COMPLETED. Updated; your analysis of the header-leak risk in _build_node_input was spot-on, too.

result=None, Status.SKIPPED throughout. _build_node_input filters SKIPPED deps before building dependency_results, so when all upstream deps are skipped the downstream node gets just the original task with no orphaned "From step_a:" header. _from_dict reads the persisted NodeResult.status to restore Status.SKIPPED correctly on resume. Also added skipped_nodes to GraphResult and adjusted completed_nodes to exclude SKIPPED nodes, which closes the observability gap.

SKIPPED nodes live in completed_nodes rather than a separate set. That satisfies both _find_newly_ready_nodes and _compute_ready_nodes_for_resume without touching either function. reset_on_revisit works correctly for loop re-entry, too: SKIPPED nodes are in completed_nodes, so they get cleared and become eligible for execution again on the next iteration.

edge.should_traverse is still evaluated for outgoing edges of SKIPPED nodes, which preserves the division you described: edge conditions control whether to enter a path before batch formation, hooks control whether to execute after.

CANCELLED (abort-branch semantics) is a separate concern; I'll open an issue.

@Zelys-DFKH

Copy link
Copy Markdown
Contributor Author

Opened #2401 to track CANCELLED (abort-branch semantics). It also surfaces the cancel_nodeskip_node naming question — happy to make that rename in this PR if maintainers prefer Option A before it merges.

@Zelys-DFKH Zelys-DFKH changed the title fix(graph): treat cancel_node as control flow, not fatal error fix(graph): introduce Status.SKIPPED for cancel_node; graph continues downstream May 30, 2026
Comment thread strands-py/src/strands/hooks/events.py
@Zelys-DFKH Zelys-DFKH force-pushed the fix/graph-cancel-node-graceful-skip branch from 14642dc to 593da1d Compare June 5, 2026 20:26
@github-actions github-actions Bot added size/m and removed size/m labels Jun 5, 2026
@github-actions github-actions Bot added size/m and removed size/m labels Jun 5, 2026
@yananym

yananym commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Do you mind rebasing? I am asking Strands team to review

@JackYPCOnline JackYPCOnline self-assigned this Jun 5, 2026
@Zelys-DFKH Zelys-DFKH force-pushed the fix/graph-cancel-node-graceful-skip branch from 333dc08 to 504027b Compare June 5, 2026 21:01
@github-actions github-actions Bot added size/m and removed size/m labels Jun 5, 2026
if before_event.cancel_node:
cancel_message = (
before_event.cancel_node if isinstance(before_event.cancel_node, str) else "node cancelled by user"
skip_value = before_event.skip_node or before_event.cancel_node

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The skip_node or cancel_node short-circuit means if both are set, only skip_node's value is used (since it's checked first with or). This interaction isn't documented and could surprise users who set both fields.

Suggestion: Add a brief note in the BeforeNodeCallEvent docstring about precedence, e.g., "If both skip_node and cancel_node are set, skip_node takes precedence." Alternatively, you could log a warning when both are truthy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Added a note to the skip_node docstring: "Takes precedence over cancel_node when both are truthy."


total_nodes: int = 0
completed_nodes: int = 0
skipped_nodes: int = 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The GraphResult.completed_nodes field previously counted all settled nodes but now excludes skipped nodes. This is a behavioral change to an existing public field that users may be relying on (e.g., if result.completed_nodes == result.total_nodes: ... to check "all done").

Suggestion: Document this breaking change prominently in the PR description. Users who relied on completed_nodes == total_nodes to mean "graph finished all work" would now need completed_nodes + skipped_nodes == total_nodes (excluding failures/interrupts). Consider whether this warrants a note in a changelog or migration guide.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direction is the opposite — skipped nodes are added to completed_nodes (the skip path calls self.state.completed_nodes.add(node)). Before this PR, cancel_node raised RuntimeError, so skip didn't exist as a concept; nothing assumed completed_nodes excluded skipped nodes. Callers that want to distinguish can check node.execution_status == Status.SKIPPED on individual results.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction to my previous reply: GraphResult.completed_nodes (the public int in _build_result) only counts Status.COMPLETED nodes — skipped ones are filtered by the if n.execution_status == Status.COMPLETED guard. The bot's observation about completed_nodes == total_nodes is right for new callers using skip_node: they'd need completed_nodes + skipped_nodes == total_nodes to mean "all work settled." That's what skipped_nodes is for. The backward-compat point holds: cancel_node raised RuntimeError before this PR, so no existing code relied on skip behavior and there's nothing to migrate.

)
@pytest.mark.asyncio
async def test_graph_cancel_node(cancel_node, cancel_message):
async def test_graph_cancel_node_deprecated(cancel_node, cancel_message):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The test name says "deprecated" and the docstring says "emits DeprecationWarning", but since the deprecation warning was intentionally removed from the source code (per review feedback from yananym), this test name and docstring are now inaccurate.

Suggestion: If no deprecation warning will be emitted, rename to test_graph_cancel_node_legacy or test_graph_cancel_node_backward_compat and update the docstring to say "cancel_node still works as a backward-compatible alias for skip_node".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecation warning wasn't removed — BeforeNodeCallEvent.__setattr__ still emits it when cancel_node is assigned a truthy value. The test names are accurate.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Assessment: Request Changes

The core fix is sound — replacing RuntimeError propagation with Status.SKIPPED is the right approach, and the dependency filtering/resume logic is well-implemented. However, there's a critical test/source mismatch that will cause CI failures.

Review Categories
  • Critical: Tests assert pytest.warns(DeprecationWarning, match="cancel_node is deprecated") in 3 test functions, but no deprecation warning is emitted anywhere in the source code. These tests will fail. Either emit the warning or remove the assertion.
  • API semantics: GraphResult.completed_nodes field is a behavioral breaking change (now excludes skipped nodes) — should be documented.
  • Naming consistency: cancel_message variable name is a leftover from the old semantics; should be skip_message to match the new API.
  • Swarm behavior: Status.FAILED on skip is inherited from the old code but deserves a comment explaining the design rationale.

Good work on the _build_node_input filtering, _from_dict status restoration, and thorough test coverage for the new skip_node path.

@Zelys-DFKH Zelys-DFKH force-pushed the fix/graph-cancel-node-graceful-skip branch from 06e6cf0 to c162369 Compare June 8, 2026 18:02
@github-actions github-actions Bot added size/m and removed size/m labels Jun 8, 2026
@github-actions github-actions Bot added size/m and removed size/m labels Jun 8, 2026
@github-actions github-actions Bot added size/m and removed size/m labels Jun 8, 2026
Replace Status.COMPLETED + RuntimeError with Status.SKIPPED + None for
nodes bypassed via cancel_node. A skipped node never ran and should not
be marked completed; downstream readiness checks treat SKIPPED the same
as COMPLETED by keeping skipped nodes in completed_nodes.

- Add Status.SKIPPED enum value
- NodeResult.result accepts None; to_dict/from_dict round-trip via
  {"type": "skipped"}; get_agent_results() returns [] for None
- _build_node_input filters SKIPPED deps before building
  dependency_results, so downstream nodes receive only the original
  task with no orphaned "From node_x:" header
- _from_dict restores Status.SKIPPED on node.execution_status via
  the persisted NodeResult.status rather than blindly setting COMPLETED
- GraphResult gains skipped_nodes counter; completed_nodes now counts
  only nodes that actually ran
- reset_on_revisit correctly clears SKIPPED nodes for loop re-entry
  (they are already in completed_nodes)
Introduce BeforeNodeCallEvent.skip_node as the preferred API for
bypassing a node in graph/swarm hooks. BeforeNodeCallEvent.cancel_node
continues to work but now emits a DeprecationWarning at the assignment
site via __setattr__, so the warning points to user code rather than
framework internals.

Introduce MultiAgentNodeSkipEvent (type: multiagent_node_skip) for the
skip signal. MultiAgentNodeCancelEvent is kept but documented as
reserved for the future abort-branch CANCELLED semantic per strands-agents#2401.

Behaviour:
- Graph: skip-and-continue — node is settled as SKIPPED, downstream
  nodes proceed normally.
- Swarm: skip-and-abort — existing abort behaviour is unchanged; swarm
  stops the current run.
- Warning fires unconditionally when cancel_node is set truthy, even
  when skip_node is also set.
- Falsy values (False, "") mean "do not skip" on both fields.
…ify semantics

The variable name cancel_message was inherited from the old cancel_node API.
With skip_node as the primary interface, rename to skip_message throughout.
Also add docstrings to GraphResult.completed_nodes and skipped_nodes to
document that completed_nodes counts only nodes that ran to completion (not
skipped nodes), and add an inline comment explaining why a skipped node in
a linear swarm sets Status.FAILED rather than Status.SKIPPED.
…ip semantics

Shorten overlong inline comment in swarm.py from 224 to 119 chars (ruff E501).
Replace the stale RST deprecated directive on MultiAgentNodeCancelEvent with a
plain note that the class is not currently emitted (reserved for issue strands-agents#2401).
Add cancel_node alias mention to MultiAgentNodeSkipEvent docstring.
Update test_cancel.py: swap multiagent_node_cancel to multiagent_node_skip,
remove the RuntimeError expectation from the graph test (graph continues after
skip rather than raising), and expect Status.COMPLETED for the graph case.
@Zelys-DFKH Zelys-DFKH force-pushed the fix/graph-cancel-node-graceful-skip branch from 1ab0a04 to 9198f8e Compare June 10, 2026 23:01
@github-actions github-actions Bot added size/m and removed size/m labels Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-hooks Features or requests that might be implementable via hooks area-multiagent Multi-agent related bug Something isn't working python Pull requests that update python code size/m

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] cancel_node in BeforeNodeCallEvent raises RuntimeError that kills the entire graph on resume

4 participants